Goto

Collaborating Authors

 Peñuelas



Bench4KE: Benchmarking Automated Competency Question Generation

Lippolis, Anna Sofia, Ragagni, Minh Davide, Ciancarini, Paolo, Nuzzolese, Andrea Giovanni, Presutti, Valentina

arXiv.org Artificial Intelligence

The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.


SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection

Kazemi, Arefeh, Qadeer, Hamza, Wagner, Joachim, Hosseini, Hossein, Kalaivendan, Sri Balaaji Natarajan, Davis, Brian

arXiv.org Artificial Intelligence

We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.


"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Zhang, Ziyi, Sun, Zhen, Zhang, Zongmin, Peng, Zifan, Zhao, Yuemeng, Wang, Zichun, Luo, Zeren, Zuo, Ruiting, He, Xinlei

arXiv.org Artificial Intelligence

The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.


Generative AI for Self-Adaptive Systems: State of the Art and Research Roadmap

Li, Jialong, Zhang, Mingyue, Li, Nianyu, Weyns, Danny, Jin, Zhi, Tei, Kenji

arXiv.org Artificial Intelligence

Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this paper aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI's within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.


Large Language Models for Software Engineering: A Reproducibility Crisis

Siddiq, Mohammed Latif, Islam-Gomes, Arvin, Sekerak, Natalie, Santos, Joanna C. S.

arXiv.org Artificial Intelligence

Reproducibility is a cornerstone of scientific progress, yet its state in large language model (LLM)-based software engineering (SE) research remains poorly understood. This paper presents the first large-scale, empirical study of reproducibility practices in LLM-for-SE research. We systematically mined and analyzed 640 papers published between 2017 and 2025 across premier software engineering, machine learning, and natural language processing venues, extracting structured metadata from publications, repositories, and documentation. Guided by four research questions, we examine (i) the prevalence of reproducibility smells, (ii) how reproducibility has evolved over time, (iii) whether artifact evaluation badges reliably reflect reproducibility quality, and (iv) how publication venues influence transparency practices. Using a taxonomy of seven smell categories: Code and Execution, Data, Documentation, Environment and Tooling, Versioning, Model, and Access and Legal, we manually annotated all papers and associated artifacts. Our analysis reveals persistent gaps in artifact availability, environment specification, versioning rigor, and documentation clarity, despite modest improvements in recent years and increased adoption of artifact evaluation processes at top SE venues. Notably, we find that badges often signal artifact presence but do not consistently guarantee execution fidelity or long-term reproducibility. Motivated by these findings, we provide actionable recommendations to mitigate reproducibility smells and introduce a Reproducibility Maturity Model (RMM) to move beyond binary artifact certification toward multi-dimensional, progressive evaluation of reproducibility rigor.


Improving LLM-based Ontology Matching with fine-tuning on synthetic data

Sousa, Guilherme, Lima, Rinaldo, Trojahn, Cassia

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly being integrated into various components of Ontology Matching pipelines. This paper investigates the capability of LLMs to perform ontology matching directly on ontology modules and generate the corresponding alignments. Furthermore, it is explored how a dedicated fine-tuning strategy can enhance the model's matching performance in a zero-shot setting. The proposed method incorporates a search space reduction technique to select relevant subsets from both source and target ontologies, which are then used to automatically construct prompts. Recognizing the scarcity of reference alignments for training, a novel LLM-based approach is introduced for generating a synthetic dataset. This process creates a corpus of ontology submodule pairs and their corresponding reference alignments, specifically designed to fine-tune an LLM for the ontology matching task. The proposed approach was evaluated on the Conference, Geolink, Enslaved, Taxon, and Hydrography datasets from the OAEI complex track. The results demonstrate that the LLM fine-tuned on the synthetically generated data exhibits superior performance compared to the non-fine-tuned base model. The key contribution is a strategy that combines automatic dataset generation with fine-tuning to effectively adapt LLMs for ontology matching tasks.


Minimizing Hyperbolic Embedding Distortion with LLM-Guided Hierarchy Restructuring

Ayoughi, Melika, Mettes, Pascal, Groth, Paul

arXiv.org Artificial Intelligence

Hyperbolic geometry is an effective geometry for embedding hierarchical data structures. Hyperbolic learning has therefore become increasingly prominent in machine learning applications where data is hierarchically organized or governed by hierarchical semantics, ranging from recommendation systems to computer vision. The quality of hyperbolic embeddings is tightly coupled to the structure of the input hierarchy, which is often derived from knowledge graphs or ontologies. Recent work has uncovered that for an optimal hyperbolic embedding, a high branching factor and single inheritance are key, while embedding algorithms are robust to imbalance and hierarchy size. To assist knowledge engineers in reorganizing hierarchical knowledge, this paper investigates whether Large Language Models (LLMs) have the ability to automatically restructure hierarchies to meet these criteria. We propose a prompt-based approach to transform existing hierarchies using LLMs, guided by known desiderata for hyperbolic embeddings. Experiments on 16 diverse hierarchies show that LLM-restructured hierarchies consistently yield higher-quality hyperbolic embeddings across several standard embedding quality metrics. Moreover, we show how LLM-guided hierarchy restructuring enables explainable reorganizations, providing justifications to knowledge engineers.


Mind the Gap: Aligning Knowledge Bases with User Needs to Enhance Mental Health Retrieval

Chan, Amanda, Liu, James Jiayu, Kai, He, Kampman, Onno P.

arXiv.org Artificial Intelligence

Access to reliable mental health information is vital for early help-seeking, yet expanding knowledge bases is resource-intensive and often misaligned with user needs. This results in poor performance of retrieval systems when presented concerns are not covered or expressed in informal or contextualized language. We present an AI-based gap-informed framework for corpus augmentation that authentically identifies underrepresented topics (gaps) by overlaying naturalistic user data such as forum posts in order to prioritize expansions based on coverage and usefulness. In a case study, we compare Directed (gap-informed augmentations) with Non-Directed augmentation (random additions), evaluating the relevance and usefulness of retrieved information across four retrieval-augmented generation (RAG) pipelines. Directed augmentation achieved near-optimal performance with modest expansions--requiring only a 42% increase for Query Transformation, 74% for Reranking and Hierarchical, and 318% for Baseline--to reach ~95% of the performance of an exhaustive reference corpus. In contrast, Non-Directed augmentation required substantially larger and thus practically infeasible expansions to achieve comparable performance (232%, 318%, 403%, and 763%, respectively). These results show that strategically targeted corpus growth can reduce content creation demands while sustaining high retrieval and provision quality, offering a scalable approach for building trusted health information repositories and supporting generative AI applications in high-stakes domains.


Computational frame analysis revisited: On LLMs for studying news coverage

Kunjar, Sharaj, Smith, Alyssa Hasegawa, Mckenzie, Tyler R, Mohbe, Rushali, Scarpino, Samuel V, Welles, Brooke Foucault

arXiv.org Artificial Intelligence

Computational approaches have previously shown various promises and pitfalls when it comes to the reliable identification of media frames. Generative LLMs like GPT and Claude are increasingly being used as content analytical tools, but how effective are they for frame analysis? We address this question by systematically evaluating them against their computational predecessors: bag-of-words models and encoder-only transformers; and traditional manual coding procedures. Our analysis rests on a novel gold standard dataset that we inductively and iteratively developed through the study, investigating six months of news coverage of the US Mpox epidemic of 2022. While we discover some potential applications for generative LLMs, we demonstrate that they were consistently outperformed by manual coders, and in some instances, by smaller language models. Some form of human validation was always necessary to determine appropriate model choice. Additionally, by examining how the suitability of various approaches depended on the nature of different tasks that were part of our frame analytical workflow, we provide insights as to how researchers may leverage the complementarity of these approaches to use them in tandem. We conclude by endorsing a methodologically pluralistic approach and put forth a roadmap for computational frame analysis for researchers going forward.